Describe variation in spawn timing and how it relates to environmental covariates.
This study was conducted in the Middle Fork of the Salmon River (MFSR) in central Idaho (Fig. 1).
Spawn timing data for Chinook salmon were collected from 2001 to 2005 in the MFSR. We removed data from 2001, and data from Knapp Creek and Cape Horn Creek, as these sites were not consistently sampled.
We consider the day of year a redd is assumed to be complete as our response variable (spawn timing).
Each redd location was assigned a COMID based on the NHD, which is analogous to a stream reach. The COMID is used to link the spawn timing data with covariate data associated with the stream reach on which it is located.
# Response data ---------------
spawn_data <- read_csv(here("data", "russ_spawn", "mfsr_spawn_cleaned.csv"))
# remove bad data
spawn_data <- spawn_data |>
filter(stream != "Knapp" & stream != "Cape Horn" & year != 2001) |>
mutate(year = as.factor(year), stream = as.factor(stream), COMID = as.factor(COMID))
glimpse(spawn_data)
## Rows: 3,016
## Columns: 18
## $ UNIQUE_ID <dbl> 1295, 1296, 1298, 1299, 1300, 1301, 1302, 1303, 1304, 1305,…
## $ COMID <fct> 23519365, 23519365, 23519319, 23519319, 23519319, 23519317,…
## $ REACH <chr> "BEARVAL1", "BEARVAL1", "BEARVAL1", "BEARVAL1", "BEARVAL1",…
## $ GNIS_NAME <chr> "Bear Valley Creek", "Bear Valley Creek", "Bear Valley Cree…
## $ OBS_CLASS <chr> "MONITOR", "MONITOR", "MONITOR", "MONITOR", "MONITOR", "MON…
## $ OBS_NAME <chr> "RN", "RN", "RN", "RN", "RN", "RN", "RN", "RN", "RN", "RN",…
## $ DATE <chr> "08232002", "08232002", "08232002", "08232002", "08232002",…
## $ FEATURES <chr> "26", "27", "23", "24", "25", "22", "57", "71", "56", "72",…
## $ Feature_ID <chr> "26", "27", "23", "24", "25", "22", "57", "71", "56", "72",…
## $ OBJ_CLASS <chr> "Redd", "Redd", "Redd", "Redd", "Redd", "Redd", "Redd", "Re…
## $ EASTING <dbl> 629674, 629674, 629696, 629696, 629696, 629731, 629731, 629…
## $ NORTHING <dbl> 4918590, 4918590, 4918716, 4918716, 4918716, 4918801, 49188…
## $ Comments <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ date <date> 2002-08-23, 2002-08-23, 2002-08-23, 2002-08-23, 2002-08-23…
## $ year <fct> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002,…
## $ yday <dbl> 235, 235, 235, 235, 235, 235, 238, 246, 238, 246, 246, 238,…
## $ month <dbl> 8, 8, 8, 8, 8, 8, 8, 9, 8, 9, 9, 8, 8, 9, 9, 8, 9, 9, 9, 9,…
## $ stream <fct> Bear Valley, Bear Valley, Bear Valley, Bear Valley, Bear Va…
# length(unique(spawn_data$COMID)) # 104
# length(unique(spawn_data$stream)) # 8
# length(unique(spawn_data$year)) # 4
The data comprise 3016 redd observation from 8 streams across 4 years. The redds were observed between day 206 and 263.
To test for environmental factors driving variation in spawn timing, we quantified associations with metrics describing thermal and physical conditions in stream reaches.
We selected covariates based on the following criteria: (1) they are known to influence spawn timing, (2) they are available for all streams, and (3) they are not highly correlated with each other.
Our focal independent variable were the average daily flow (cfs) and stream temperature (°C) during the 30, 60, and 90 days prior to spawning. We also included the slope of the stream reach and the mean elevation of the stream reach.
# make a df of just response and covariates
model_data <- combined_data |>
filter(SLOPE < .2) |> # bad slope data
select(
yday, COMID, stream, year,
flow_30, flow_60, flow_90,
temp_30, temp_60, temp_90,
SLOPE, mean_elevation
)
glimpse(model_data)
## Rows: 3,013
## Columns: 12
## $ yday <dbl> 235, 235, 235, 235, 235, 235, 238, 246, 238, 246, 246, …
## $ COMID <fct> 23519365, 23519365, 23519319, 23519319, 23519319, 23519…
## $ stream <fct> Bear Valley, Bear Valley, Bear Valley, Bear Valley, Bea…
## $ year <fct> 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2002, 2…
## $ flow_30 <dbl> 650.8333, 650.8333, 650.8333, 650.8333, 650.8333, 650.8…
## $ flow_60 <dbl> 997.4333, 997.4333, 997.4333, 997.4333, 997.4333, 997.4…
## $ flow_90 <dbl> 1992.733, 1992.733, 1992.733, 1992.733, 1992.733, 1992.…
## $ temp_30 <dbl> 12.70318, 12.70318, 14.02703, 14.02703, 14.02703, 13.78…
## $ temp_60 <dbl> 12.87280, 12.87280, 14.20337, 14.20337, 14.20337, 13.95…
## $ temp_90 <dbl> 11.07830, 11.07830, 12.15472, 12.15472, 12.15472, 11.92…
## $ SLOPE <dbl> 0.00299158, 0.00299158, 0.00171806, 0.00171806, 0.00171…
## $ mean_elevation <dbl> 1951.730, 1951.730, 1946.735, 1946.735, 1946.735, 1944.…
We evaluated variation in spawn timing using histograms, boxplots, density plots. We also calculated summary statistics for spawn timing.
model_data |>
select(temp_30, temp_60, temp_90, flow_30, flow_60, flow_90, mean_elevation) |>
GGally::ggpairs()
Check VIFs with and without flow_30 and
flow_60:
car::vif(lm(yday ~ flow_30 + flow_60 + flow_90 + temp_90 + mean_elevation + stream + year, data = model_data))
## GVIF Df GVIF^(1/(2*Df))
## flow_30 54.18575 1 7.361097
## flow_60 42.25228 1 6.500176
## flow_90 13.47150 1 3.670354
## temp_90 15.37376 1 3.920938
## mean_elevation 21.09913 1 4.593379
## stream 64.19481 7 1.346192
## year 54.26889 3 1.945771
car::vif(lm(yday ~ flow_90 + temp_90 + mean_elevation + stream + year, data = model_data))
## GVIF Df GVIF^(1/(2*Df))
## flow_90 9.605784 1 3.099320
## temp_90 14.146143 1 3.761136
## mean_elevation 20.418174 1 4.518647
## stream 53.393667 7 1.328594
## year 8.676807 3 1.433486
Check simple models for temp to examine functional structure:
m1 <- lm(yday ~ temp_90, data = model_data)
m2 <- lm(yday ~ temp_90 * stream, data = model_data)
m3 <- lm(yday ~ temp_90 * stream + poly(temp_90, 2), data = model_data)
m4 <- lm(yday ~ poly(temp_90, 2), data = model_data)
m5 <- gam(yday ~ s(temp_90, k = 5), data = model_data, method = "REML")
m6 <- gam(yday ~ s(temp_90, by = stream, k = 5), data = model_data, method = "REML")
AIC(m1, m2, m3, m4, m5, m6) |>
arrange(AIC) |>
mutate(delta = AIC - min(AIC)) |>
kableExtra::kbl() |>
kableExtra::kable_styling("striped", full_width = F)
| df | AIC | delta | |
|---|---|---|---|
| m6 | 32.383524 | 15894.98 | 0.0000 |
| m3 | 18.000000 | 16192.42 | 297.4428 |
| m2 | 17.000000 | 16547.05 | 652.0709 |
| m5 | 5.985751 | 17629.35 | 1734.3790 |
| m4 | 4.000000 | 17640.10 | 1745.1201 |
| m1 | 3.000000 | 17983.29 | 2088.3110 |
m11 <- lm(yday ~ temp_90, data = model_data)
m22 <- lm(yday ~ temp_90 * year, data = model_data)
m33 <- lm(yday ~ temp_90 * year + poly(temp_90, 2), data = model_data)
m44 <- lm(yday ~ poly(temp_90, 2), data = model_data)
m55 <- gam(yday ~ s(temp_90, k = 5), data = model_data, method = "REML")
m66 <- gam(yday ~ s(temp_90, by = year, k = 5), data = model_data, method = "REML")
AIC(m11, m22, m33, m44, m55, m66) |>
arrange(AIC) |>
mutate(delta = AIC - min(AIC)) |>
kableExtra::kbl() |>
kableExtra::kable_styling("striped", full_width = F)
| df | AIC | delta | |
|---|---|---|---|
| m33 | 10.000000 | 17283.11 | 0.0000 |
| m66 | 17.942638 | 17327.23 | 44.1178 |
| m22 | 9.000000 | 17459.41 | 176.2954 |
| m55 | 5.985751 | 17629.35 | 346.2407 |
| m44 | 4.000000 | 17640.10 | 356.9817 |
| m11 | 3.000000 | 17983.29 | 700.1726 |
Check simple models for flow to examine functional structure:
# compare linear, quadriatic, and gam model
m1 <- lm(yday ~ flow_90, data = model_data)
m2 <- lm(yday ~ flow_90 * year, data = model_data)
m3 <- lm(yday ~ flow_90 * year + poly(flow_90, 2), data = model_data)
m4 <- lm(yday ~ poly(flow_90, 2), data = model_data)
m5 <- gam(yday ~ s(flow_90, k = 5), data = model_data, method = "REML")
m6 <- gam(yday ~ s(flow_90, by = year, k = 5), data = model_data, method = "REML")
AIC(m1, m2, m3, m4, m5, m6) |>
arrange(AIC) |>
kableExtra::kbl() |>
kableExtra::kable_styling("striped", full_width = F) |>
kableExtra::add_header_above(c(" " = 1, "AIC" = 2))
| df | AIC | |
|---|---|---|
| m6 | 17.999597 | 7964.960 |
| m3 | 10.000000 | 9654.838 |
| m2 | 9.000000 | 10061.171 |
| m5 | 5.998777 | 18324.766 |
| m4 | 4.000000 | 18455.969 |
| m1 | 3.000000 | 18662.693 |
# sjPlot::tab_model(m1, m2, m3, m4, m5, m6)
# summary(m2)
# prediction grid
pred_data <- model_data %>%
group_by(year) %>%
summarize(min_flow = min(flow_90), max_flow = max(flow_90)) %>%
rowwise() %>%
do(data.frame(year = .$year,
flow_90 = seq(.$min_flow, .$max_flow, length.out = 100))) %>%
ungroup()
# Predict
pred_data$yday_pred_m2 <- predict(m2, newdata = pred_data)
pred_data$yday_pred_m3 <- predict(m3, newdata = pred_data)
pred_data$yday_pred_m6 <- predict(m6, newdata = pred_data)
model_data |>
ggplot(aes(x = flow_90, y = yday, color = year)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "grey1") +
geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE, color = "grey50") +
geom_smooth(method = "gam", formula = y ~ s(x, k = 5), method.args = list(method = "REML"), se = FALSE, color = "grey90") +
geom_line(data = pred_data, aes(x = flow_90, y = yday_pred_m2), size = 1.2) +
geom_line(data = pred_data, aes(x = flow_90, y = yday_pred_m3), size = 1.2) +
geom_line(data = pred_data, aes(x = flow_90, y = yday_pred_m6), size = 1.2) +
theme_custom() +
labs(title = "Model fits",
subtitle = "",
y = "Day of Year",
x = "Flow (90-day mean pre-spawn)")
Including flow_90 could introduce spurious precision and possibly overfitting. Why flow_90 might be problematic:
Bottom Line: Unless we have site-specific or spatially disaggregated flow data, flow_90 is probably not a valid covariate for redd-level models.
Including it may:
Recommendation: Drop flow_90 from model (or at most, keep it as a year-level covariate if you summarize it to a single annual value for all observations)
Although we initially considered including 90-day mean streamflow
(flow_90) as a predictor of spawn timing, this variable was
ultimately excluded due to concerns about ecological validity and model
overfitting. Streamflow data were derived from a single downstream USGS
gauge and did not capture spatial variation across the study streams or
reaches. Moreover, because flow_90 was closely aligned with year, it
introduced strong collinearity with the year effect and risked
attributing site-level variation to flow patterns not actually
experienced by individual redds. As such, we excluded flow_90 to avoid
misleading inference.
Final dataset includes:
# select variables and standardize continuous covariates
scale2 <- function(x, na.rm = FALSE) (x - mean(x, na.rm = na.rm)) / sd(x, na.rm)
# model_data_final <- model_data |>
# select(yday, COMID, stream, year, temp_90, mean_elevation)
model_data_final <- model_data |>
select(yday, COMID, stream, year, temp_90, mean_elevation) |>
mutate(across(c("temp_90", "mean_elevation"), scale2))
We should only include a grouping variable as both fixef and ranef when you want to model differenct aspects of the effect. E.g., stream as fixed to compared average effects by stream, but as ranefs to account for non-independence of obs within each stream but not to compare average effects. Include as both when you want population-level estimates AND when you expect stream-level random variation around those means. Do not include both when intercepts are redundant (~ stream + (1| stream)), or too few levels for the random effect to be estimated (e.g., <5 levels).
| Variable | Fixed? | Random? | Why |
|---|---|---|---|
| COMID | No | Yes (but must check data availability among levels; likely sparse) | Not estimating individual COMID effects, just accounting for correlation. |
| Stream | Yes | Maybe (only if random slope or complex structure) | May want to estimate differences and account for grouping. |
| Year | Yes (comparing years) | Maybe (account for inter-annual variability) | Use one or the other depending on goal. |
In this case, it makes sense to include year as fixed because:
On our first couple passes, we used models with renefs, but were observing strong overfitting. Here’s what’s likely going wrong with the mixed-effects model:
High AIC-based model performance but stream-level predictions far from observed data , so random effects absorbing noise or overfitting sparse groups. Random intercepts for COMID dominate variance becuase many COMIDs have only 1 observation → intercepts become noise. Unequal data across years and streams means random intercepts get misled by imbalance, especially for sparse years (like 2005). Random slopes and interactions fail to improve fit because not enough data per group to support slope variation.
A simple lm may be better here because it treats stream and year as explicit, estimable fixed effects, which gives you real, interpretable estimates for group means. It doesn’t try to “guess” partial-pooling intercepts or slopes where data are lacking. The model becomes transparent — predictions reflect what’s in the data, not what the model infers from structure.
Let’s interrogate the grouping structure within the data to make a final decision.
| stream | n_COMIDs |
|---|---|
| Big | 21 |
| Loon | 19 |
| Camas | 16 |
| Marsh | 13 |
| Bear Valley | 11 |
| Sulphur | 11 |
| Beaver | 7 |
| Elk | 6 |
There are at least 6 COMIDs per stream, and the number of observations per COMID is highly variable. The histogram shows that most COMIDs have 1-2 observations, but some have many more. 23 COMIDs have fewer than 5 observations.
Are the enough COMIDs per stream to consider stream/COMID nested random effects? No.
Are groups well sampled? (Do most COMIDs have >1-2 redds?) No. 23 COMIDs have <5 redds (25%), 13 have <= 2. (12.5%) With <5 obs/level, variance estimates become unstable -> overfit and absorb noise (low AIC / high R2) -> singular fits.
Are year or stream-level random effects justifiable? We want to compare streams and years in this dataset, not generalize beyond them, so we should use fixed effects. Further, stream are known, of interest, and were not randomly sampled. Including (1 | stream) would be statistically redundant (stream intercepts would be double-modeled) and would lead to misleading AIC comparisons.
Random slopes?. The needs multiple obs per group across a range of slope variable (temp_90), enough replication to estimate variation in slopes, not just intercepts. Need ~8+ obs per group spread across the covariate. We could do this for stream or year. Prob not.
So, we should use a simple linear model with no random effects. This allows for interpretable results, no overfitting from random effects, and is a reasonable approach given the design.
If we did try (1 | COMID), expect singularity, low variance estimates for COMID, and predictions will likley miss the groups means.
## df AIC delta
## m_rand 14 13541.11 0.000
## m_fixed 13 16245.22 2704.107
Does adding (1|COMID) improve it without overfiitg? Improves AIC by >2700. Is the variance explained by COMID non-zero and meaningful? Variance is 60 and likely meaninful. But, is it overfitting? Yes, the model is overfitting. The AIC is much lower, but the predictions are not close to the observed data (Not shown).
| Interaction | Description | Possible Interpretation |
|---|---|---|
temp_90 × stream |
Temperature effects on spawn timing vary across streams | Some streams may be more thermally sensitive (e.g., due to groundwater, shading, etc.) |
temp_90 × year |
Interannual differences in how temperature affects spawn timing | Annual climate conditions (e.g., snowmelt timing, baseflow) modify temp-timing response |
stream × year |
Stream-specific interannual variation | Some streams respond more strongly to wet/dry or hot/cool years |
mean_elevation × stream |
Elevation effects on timing vary by stream | Elevation proxies snowpack or gradient, which may matter differently by stream |
mean_elevation × year |
Elevation effects change across years | Snowmelt timing shifts may be more impactful in some years |
temp_90 × mean_elevation |
Thermal effects depend on elevation | High-elevation streams may respond more strongly or weakly to the same temp range |
The most plausible core interactions to include a prior are: -
temp_90 × stream - temp_90 × year
Others like stream x year are data hungry and may
overfit without strong replication. Interactions with
mean_elevation does not make much sense.
We used AIC-based model selection to identify the best-supported linear model that included temp_90, mean_elevation, stream, year, and all two-way interactions. This model outperformed all simpler alternatives by a large AIC margin. We then tested an alternative model including a raw quadratic term (I(temp_90^2)) to account for potential nonlinearity in the temperature–spawn timing relationship. Model performance was evaluated using AIC and visual assessment of stream- and year-specific predictions.
First we’ll fit simple linear models to get a sense of the data and the relationships. Adding new interactions each time.
# m0 <- lmer(yday ~ temp_90 + stream + year + (1 | COMID), data = model_data_final)
m1 <- lm(yday ~ temp_90 + mean_elevation + stream + year, data = model_data_final)
m2 <- lm(yday ~ temp_90 * stream + temp_90 + mean_elevation + stream + year, data = model_data_final)
m3 <- lm(yday ~ temp_90 * stream + temp_90 * year + temp_90 + mean_elevation + stream + year, data = model_data_final)
m4 <- lm(yday ~ temp_90 * stream + temp_90 * year + stream * year + temp_90 + mean_elevation + stream + year, data = model_data_final)
AIC(m1, m2, m3, m4) |>
arrange(AIC) |>
mutate(delta = AIC - min(AIC))
## df AIC delta
## m4 43 14657.31 0.0000
## m3 24 15022.20 364.8860
## m2 21 15242.00 584.6876
## m1 14 15933.83 1276.5133
Full interaction (m4) is best. We will now compare all possible combinations of the interactions, except stream x year.
# global_model <- lm(yday ~ temp_90 * stream + temp_90 * year + stream * year + temp_90 + mean_elevation + stream + year, data = model_data_final)
global_model <- lm(yday ~ temp_90 * stream + temp_90 * year + temp_90 + stream + year + mean_elevation, data = model_data_final)
# Run dredge
options(na.action = "na.fail") # Required for dredge
model_set <- dredge(global_model, trace = 1, rank = "AIC")
model_set
Model 64 with all 2-way interactions (save stream x year) is the best supported model based on AIC.
NOTE ELEVATION IS THE ISSUE CAUSING OFFSETS!!!! The best fitting model without elevation in the full model with all 2-way interactions. When plotting predictions using the model without elevation, the linear offsets are no more. This makes sense because the estimated coefficients for models with elevation are positive, which does not align with the observed data:
final_model <- get.models(model_set, 1)[[1]]
tmp <- ggeffects::ggpredict(final_model, terms = c("mean_elevation"))
ggplot(tmp, aes(x = x, y = predicted)) +
geom_line(size = 1.1) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
geom_point(data = model_data_final, aes(x = mean_elevation, y = yday), size = .5) +
theme_custom() +
labs(title = "Elevation effect on spawn timing",
subtitle = "Note the strong mismatch between predicted trend and observed data",
x = "Mean Elevation (m)", y = "Spawn Day of Year")
So we will select the best fitting model without elevation (model 127; AIC rank = 6)
final_model <- get.models(model_set, 4)[[1]]
Next we will fit a new best + quadratic model.
## df AIC delta
## lm_poly 24 15521.32 0.0000
## final_model 23 15731.60 210.2749
The poly model improves AIC by ~210, major improvement. So the curvature is helping.
# get preds for each model
model_data_final$pred_linear <- predict(final_model, newdata = model_data_final)
model_data_final$pred_quad <- predict(lm_poly, newdata = model_data_final)
# compute error metrics by stream and year
model_error_summary <- model_data_final %>%
mutate(error_linear = abs(yday - pred_linear),
error_quad = abs(yday - pred_quad)) %>%
group_by(stream, year) %>%
summarize(
n = n(),
MAE_linear = mean(error_linear),
MAE_quad = mean(error_quad),
RMSE_linear = sqrt(mean((yday - pred_linear)^2)),
RMSE_quad = sqrt(mean((yday - pred_quad)^2)),
.groups = "drop"
)
# Long format
model_error_long <- model_error_summary %>%
pivot_longer(cols = starts_with("RMSE"), names_to = "model", values_to = "RMSE") %>%
mutate(model = ifelse(model == "RMSE_linear", "Linear", "Quadratic"))